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Abstract 

We derive an asymptotic expansion for the excess risk (regret) of a weighted nearest- 
neighbour classifier. This allows us to find the asymptotically optimal vector of non- 
negative weights, which has a rather simple form. We show that the ratio of the regret of 
this classifier to that of an unweighted A;-nearest neighbour classifier depends asymptoti- 
cally only on the dimension d of the feature vectors, and not on the underlying populations. 
The improvement is greatest when d = 4, but thereafter decreases as d — > oo. The popular 
bagged nearest neighbour classifier can also be regarded as a weighted nearest neighbour 
classifier, and we show that its corresponding weights are somewhat suboptimal when d 
is small (in particular, worse than those of the unweighted fc- nearest neighbour classifier 
when d = 1), but are close to optimal when d is large. Finally, we argue that improvements 
in the rate of convergence are possible under stronger smoothness assumptions, provided 
we allow negative weights. 

Key words: Bagging, classification, nearest neighbours, weighted nearest neighbour clas- 
sifiers. 



1 Introduction 



Supervised classification, also known as pattern recognition, is a fundamental problem in Statis- 
tics, as it represents an abstraction of the decision-making problem faced by many applied 
practitioners. Examples include a doctor making a medical diagnosis, a handwriting expert 
performing an authorship analysis, or an email filter deciding whether or not a message is 
genuine. 

Classifiers based on nearest neighbours are perhaps the simplest and most intuitively ap- 
pealing of all nonparame tric classifiers. T h e fc-n earest neighbour classif ier was originally studie d 
in the seminal works of Fix and Hodges f 195ll ) (later republished as Fix and Hodges f 19891 )) 
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and Cover and Hart ( 1967 ). but it retains its popularity todajQ. Surprisingly, it is o nly recently 



that detailed understanding of the nature of the error probabilities has emerged ( iHall et al. 



20081) 



Arguably the most obvious defect with the fc-nearest neighbour classifier is that it places 
equal weight on the class labels of each of the k nearest neighbours to the point x being classified. 
Intuitively, one would expect improvements in terms of the misclassification rate to be possible 
by putting decreasing weights on the class labels of the successively more distant neighbours. 

The first purpose of this paper is to describe the asymptotic structure of the difference 
between the misclassification rate (risk) of a weighted nearest neighbour classifier and that of 
the optimal Bayes classifier for classification problems with feature vectors in W^. Theorem [1] 
in Section [2] below shows that, subject to certain regularity conditions on the underlying distri- 
butions of each class and the weights, this excess risk (or regret) asymptotically decomposes as 
a sum of two dominant terms, one representing bias and the other representing variance. For 
simplicity of exposition, we will deal initially with binary classification problems, though we 
also indicate the appropriate extension to general multicategory problems. 

Our second contribution, following on from the first, is to derive the vector of non- negative 
weights that is asymptotically optimal in the sense of minimising the misclassification rate (cf. 
Theorem [2]). In fact this asymptotically optimal weight vector has a relatively simple form: let 
n denote the sample size and let Wni denote the weight assigned to the ith nearest neighbour 
(normalised so that Wni = !)• Then the optimal choice is to set k* = [B*n^^^'^~^^^ (an 
explicit expression for B* is given in (12. 4p below) and then let 

F [1 + f - WWi^'"-'^' - - ^r'^'}] for . = 1, . . . , A:* 
i = k* + l,...,n. ^ ' ^ 

Thus, in the asymptotically optimal weighting scheme, only a proportion 0{n~'^^^'^^^^) of the 
weights are positive. The maximal weight is almost {l + d/2) times the average positive weight, 
and the discrete distribution on {1, . . . ,?7,} defined by the asymptotically optimal weights de- 
creases in a concave fashion when d = 1 , in a linear fashion when d = 2 and in a convex fashion 
when d > 3] see Figure [1] When d is large, about 1/e of the weights are above the average 
positive weight. 

Another consequence of Theorem [2] is that k* is bigger by a factor of j^^^}'^^*''^^^'' than the 
asymptotically optimal choice of k for traditional, unweighted /c-nearest neighbour classification. 
It is notable that this factor, which is around 1.27 when d = 1 and increases towards 2 for 
large d, does not depend on the underlying populations. This means that there is a natural 
correspondence between any unweighted fc-nearest neighbour classifier and one of optimally 
weighted form, obtained by multiplying k by this dimension- dependent factor to obtain the 
number k' of positive weights for the weighted classifier, and then using the weights given 
in (11.11) with k' replacing k*. 

In Corollary 3 we describe the asymptotic improvement in the excess risk that is attainable 
using the procedure described in the previous paragraph. Since the rate of convergence to zero 



^For instance, as a crude measure, over two-thirds of the nearly 1500 citations of the lCover and HartI (jl967l ) 
paper have occurred in the last ten years, according to the ISI Web of Knowledge. 
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Figure 1: Optimal weight profiles at different dimensions. Here, k* = 100, and the figure 
displays the positive weights in f ll.ip . scaled to have the same weight on the nearest neighbour 
at each dimension. 



of the excess risk is 0{n~^^^'^^^^) in both cases, the improvement is in the leading constant, 
and again it is notable that the asymptotic improvement does not depend on the underlying 
populations. The improvement is relatively modest, which goes some way to explaining the 
continued popularity of the (unweighted) fc-nearest neighbour classifier. Nevertheless, for d < 
15, the improvement in regret is at least 5%, though it is negligible as (i — oo; the greatest 
improvement occurs when d = 4, and here it is just over 8%. See Figure O 



A nothe r popular way of improving the performance of a classifier is by bagging (IBreimaru . 



19961 . 11999I ). Short for 'bootstrap aggregating', bagging invol ves corabiniii g the results of many 
empirically simulated predictions. Empirical analyses, e.g. ISteeld ( l2009l ). have reported that 
bagging can result in i mprovements over unweigh t ed fc- nearest neighbour classification. More- 
over, as explained by iBiau. Cerou and Guyaderl (120 lOl ). understanding the properties of the 
bagged nearest neighbo ur classifier is also of interest because they provide insight into random 
forests (IBreimanl . 1200 ll ). Random forest algorithms have been some of the most successful 
ensemble methods for regression and classification problems, but their theoretical properties 
remain relatively poorly understood. When bagging the nearest neighbour classifier, we can 
draw resamples from the data either with- or without-replacement. We treat the 'infinite sim- 
ulation' case, where both versions take the form of a weighted nearest neighbour classifier 
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Figure 2: Asymptotic ratio of the regret of the optimally weighted nearest neighbour classifier 
to that of the optimal fc-nearest neighbour classifier, as a function of the dimension d of the 
feature vectors. 



with weights decaying approxim ately exponentially on successively more distant observatio ns 
from the point being classified ( iHall and Samworthl . l2005l : iBiau. Cerou and Guyaderl . |2010| ) . 
The crucial choice is that of the resample size, or equivalently the sampling fraction, i.e. the 
ratio of the resample size to the original sample size. In Section [3l we describe the asymp- 
totically optimal resample fraction (showing in particular that it is the same for both with- 
and without-replacement sampling) and compare its regret with those of the weighted and 
unweighted /c-nearest neighbour classifiers. 

Finally, in Section HI we consider the problem of choosing optimal weights without the re- 
striction that they should be non-negative. The situation here is somewhat analogous to the 
use of higher order kernels for classifiers based on kernel density estimates of each of the popu- 
lation densities. In particular, subject to additional smoothness assumptions on the population 
densities, we find that powers of n arbitrarily close to the 'parametric rate' of 0{n~^) for the 
excess risk are attainable. All proofs are deferred to the Appendix. 

Classification has been the sub j ect o f several b ook- l ength treatments, including lHandl f ll98lh . 



Devroye. Gyorfi and Lugosil ( 1l996l ) and lGordonI (119991). In particular, classif iers based on near- 
est nei ghbours form a central theme of iDevrove. Gvorfi and Lugosil (119961 ). The review pa- 
per by iBoucheron. Bousquet and Lugosil (120051 ) contains 243 references and provides a thor- 
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ough survey of the classification literature up to 2005. More recently, lAudibert and Tsybakov 
(120071 ) have discussed the relative merits of plug-in classifiers (a family to which weighted near- 
est neighbour classifiers b elong) and classifier s based on empirical risk minimisation, such as 



support vector machines (Cortes and Vapnikl . Il995l : iBlanchard. Bousquet and Massartl . 12008 



Steinwart and Christmannl . 120081). 



W ei ghted nearest neighbour classifiers were first studied by lRoyall (Il966[ ): see also lBailey and JainI 
( 1978 ). Stonel ( 1977 ) proved that if maxi<j<„ — )■ as n — )■ oo and X^iLi""^™ ~^ ^ for some 
k = kn with /c/n— )-0as?7,— )-oo, then risk of the weighted nearest neighb o ur cla ssifier con- 
verges to the risk of the Bayes classifier; see also Devroye. Gyorfi and Lugosil ( 1996 . p. 179). As 
mentioned above, this work attempts to study the difference between these risks more closely. 
Weighted nearest neighbour classifiers are also related to c lassifiers based on k e rnel e stimates 
of each of the clas s densi ties; see for example the review by iRaudys and Young] ( 120041 ). as well 



as 



Hall and Kand (120051 ). The 0{n~^/^'^+^^) rates of co nvergence obtained in this paper for 



non-negative weights are the same as those obtained by iHall and Kand ( l2005l ) under similar 
twice- differentiable conditions with second-order kernel estimators of the class densities. iMarron 
( 119831 ) proved that in a certain sense this is the minimax optimal rate, though his assumptions 
and context are slightly different from what is studied here. Further r elated work includes t he 



literature on highest density regi on or level set estimation (iPolonikl . Il995l : iRigoUet and Vert 



20091: ISamworth and Wandl. 120101) . 



Hall and Samworth and iBiau and Devrovd (|2010l) proved an analogous result for 

the bagged nearest neighbour classifier to the Stonel ( 19771 ) result described in the previous 
paragraph. More precisely, if the resample size m = m„ used for the bagging diverges to infinity. 



and m/n— J-Oasn— S-oo, then the risk of the bagged nearest neighbour classifier converges to 
the Bayes risk. Note that this result does not d epend on whether the resarnples are taken with 
or without replacement from the training data. iBiau. Cerou and Guyaderl ( l2010l ) have recently 
proved a striking rate of convergence result for the bagged nearest neighbour estimate; this is 
described in greater detail in Section |3l 



2 Main results 

Let {X, Y), {Xi, Yi), (X2, Y2), ... be independent and identically distributed pairs taking values 
in M"^ X {1,2}. We suppose that ¥{Y = 1) = tt = 1 — F{Y = 2) for some vr G (0,1) and 
that {X\Y = r) Pr for r = 1,2, where Pr is a probability measure on W^. We write 
P = TiPi + (1 — 7r)P2 for the marginal distribution of X and let r]{x) = F{Y = 1\X = x) denote 
the corresponding regression function. 

A classifier C is a Borel measurable function from M'^ to {1, 2}, with the interpretation that 
the point a; G M"^ is classified as belonging to class C{x). The misclassification rate, or risk of 
C over a Borel measurable set 7^ C M'^ is defined to be 

Rn{C) = F[{C{X) j^Y}l{xen}]- 
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The classifier which minimises the risk over TZ is the Bayes classifier, given by 




2 otherwise. 



1 ifr/(x)>l/2 



Its risk is 




min{?7(a;), 1 



ri{x)} dP{x). 



For each n eN, let w„ = (u7„j)"^^ denote a vector of weights, normalised so that Yl^=i '^ni = 1- 
Fix X E IZ and an arbitrary norm || ■ || on M'^, and let Y(i) ),..., denote a 

permutation of the training sample Vn = {(^i, ^i), • • • , {^n, ^n)} such that ||X(i) —x\\ < . . . < 
ll^(n) — We define the weighted nearest neighbour classifier to be 



We also write C™!^'^ where it is necessary to emphasise the weight vector, for example when 
comparing different weighted nearest neighbour classifiers. Our initial goal is to study the 
asymptotic behaviour of 



where the probability is taken over the joint distribution of {X, Y) and Vn- 

It will be convenient to define a little notation: for a smooth function 5( : M'' M, we write 
g{x) for its derivative at x, and gj{x) for its jth partial derivative at x. Analogously, we write 
g(x) for the second derivative of g at x, and gjk{x) for the (j, k)th element of the corresponding 
Hessian matrix at x. We let Bs{x) = {y E M.'^ : \\y — x\\ < 6} denote the closed ball of radius 
6 centered at x in the norm || ■ ||, and let denote the ci- dimensional Lebesgue measure of the 
unit ball Bi{x). We will make use of the following assumptions for our theoretical results: 

(A.l) The set 7^ C M'^ is a compact rf-dimensional manifold with boundary dTZ. 

(A. 2) The set S = {x E 71 : ri{x) = 1/2} is non-empty. There exists an open subset Uo of 
Mf^ that contains S and such that the following properties hold: firstly, \ri{x) — 1/2| is 
bounded away from zero for x G ?7 \ Uo, where U is an open set containing TZ; secondly 
the restrictions of Pi and P2 to Uq are absolutely continuous with respect to Lebesgue 
measure, with twice continuously differentiable Radon-Nikodym derivatives /i and /2 
respectively. 

(A. 3) There exists p > such that J^^ \\x\\'' dP{x) < 00. Moreover, for sufficiently small S > 0, 
the ratio P{Bs{x))/ {a^^'^) is bounded away from zero, uniformly for x eTZ. 

(A. 4) For all x G iS, we have rj^x) 7^ 0, and for all x G iS fl dTZ, we have drj{x) 7^ 0, where drj 
denotes the restriction of rj to dTZ. 




2 otherwise. 



1 if Er=i^™l{>'w=i} - ^2 
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The introduction of the compact set TZ finesses the probl em of performirig class ification in the 
tails of the feature vector distributions. See for example iHall and Kana (120051. Section 3) fo r 
further discussion of this point and related results, as well as IChanda and RuymgaartI ( 1l989l ). 
Mammen and Tsybakov ( 1999 ) and Audibert and Tsybakov ( 200?! ) impose similar compactness 
assumptions for their results. The set TZ may be arbitrarily large, though the larger it is, the 
stronger are the requirements in (A. 2). Although as stated, the assumptions on TZ are quite 
general, little is lost by thinking of 7^ as a large closed Euclidean ball. Its role in the asymptotic 
expansion of Theorem |2] below is that it is involved in the definition of the set S, which represents 
the decision boundary of the Bayes classifier. We will see that the behaviour of /i and /2 on 
the set S is crucial for determining the asymptotic behaviour of weighted nearest neighbour 
classifiers. 

The second part of (A. 3) asks that the ratio of the P-measure of small balls to the cor- 
responding (i- dimensional Lebesgue measure is bounded away from zero. This requirement is 
satisfied, for instance, if Pi and P2 are absolutely continuous with respect to Lebesgue measure, 
with Radon-Nikodym derivatives that are bounded away from zero on the open set U. 

The assumption in (A. 4) that fi{x) 7^ for x E S asks that /i and /2, weighted by the 
respective prior probabilities of each class, should cut at a non-zero angle along S. In the 
language of differential topology, this means that 1/2 is a regular value of the function rj, and 
the second part of (A. 4) asks for 1/2 to be a regular value of the restriction of rj to dTZ. Together, 
these two requirements ensure that S is a. {d — l)-di niensional submanifold ^ yith b oundary of 
M*^, and the boundary of 5 is {x G dTZ : ri{x) = 1/2} ( iGuillemin and Pollackl . ll974L p. 60). 

The requirem ent in (A. 4) that ri(x) ^ for x G S is related to the well-known margin 



condition of, e.g. iMammen and Tsybakov! ( 1l999l ) and iTsybakovi (120041 ): when it holds (and in 



the presence of the other conditions) , there exist c, C > such that 

ce < P(|r/(X) - 1/2| < e n X G 7^) < Ce 



(2.1) 



for sufficiently small e > 0; see Tsvbakovl (2004, Proposition 1). A proof of this fact, which uses 
Weyl's tube formula ( iGrayl . 120041 ). is given after the proof of Theorem [T] in the Appendix. In 
this sense, we work in the setting of a margin condition with the power parameter equal to 1. 
We now introduce some notation needed for Theorem [1] below. For (3 > 0, let WnR denote 



the set of all sequences of non-negative deterministic weight vectors w„ 



mJi: 



satisfying 



1 ^ 



n 



n 



-4/d 



(Zl"=i (^iWriif <n where 



ai 



•l+2/d_^^_]_^l+2/d 



note that this latter expression 



appears in (II. ip : 
n^/'^ Xir=fc2+i ^"=1 < l/logra, where fca = [ra^-^J; 



Er=i^y(Er=i<p<i/iog 



n. 
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Observe that Wn,[s-^ D Wn,t32 for Pi < (^2- The first and last conditions ensure that the weights 
are not too concentrated on a small number of points; the second amounts to a mild moment 
condition on the probability distribution on {1, . . . ,n} defined by the weights. The next two 
conditions ensure that not too much weight (or squared weight in the case of the latter condition) 
is assigned to observations that are too far from the point being classified. Although there are 
many requirements on the weight vectors, they are rather mild conditions when /3 is small, as 
can be seen by considering the limiting case /3 = 0. For instance, for the unweighted /c- nearest 
neighbour classifier with weights w„ = (iyni)r=i given by Wni = A;~^l{i<j<fc}, we have that 
w„ G Wn,p for small /3 > provided that max(n^, log^ ra) < k < mm{n^^~^'''/^\n^~^). Thus for 
the vector of /c-nearest neighbour weights to belong to W^n,/3 for ah large n, it is necessary that 
the usual conditions k ^ oo and A;/n -^■ for consistency are satisfied, and these conditions are 
almost sufficient when /3 > is small. The situation is similar for the bagged nearest neighbour 
classifier - see Section |3] below. 

The fact that the weights are assumed to be deterministic means that they depend only 
on the ordering of the distances, not the raw distances themselves (as would be the case for a 
classifier based on kernel density estimates of the p opulation densit i es). S uch classifiers are not 
necessarily straightforward to implement, however: Hall and Kangj ( 2005 ) showed that even in 



the simple situation where d = 1 and tt/i and (1 — 7r)/2 cross at a single point xq, the optimal 
order of the bandwidth for the kernel depends on the sign of fi{xo)f2{xo)- 
Continuing with our notational definitions, let / = tt/i + (1 — 7r)/2, and let 

aix) = 7-rr= , (2.2) 

(rf + 2)a2/"/(x)i+2/'^ 

where l\rj{x) = Yl'j=iVjji^) denotes the Laplacian of r/ at x. Define 

^1= /7F^^Vol'"'(^o) B,= [ -l^a{xofdVo\''"\xo), (2.3) 

Js nvixo)\\ Js mxo)\\ 

where Vol'^"^ denotes the natural {d — 1) -dimensional volume measure that S inherits as a 
subset of Mf^. Note that Bi > 0, and B2 > 0, with equality if and only if a is identically zero 
on S. Although the definitions of Bi and B2 are complicated, we will see after the statement 
of Theorem [1] below that they are comprised of terms that have natural interpretations. 

Theorem 1. Assume (A.l), (A. 2), (A. 3) and (A. 4)- Then for each /3 > 0, 

RniCrn - RniC'^n = 7n(w„){l + o(l)}, 
as n —)■ 00, uniformly for w„ G Wn,^, where 



n / " \ 2 

(w„) = s.E-~+siE^) ■ 

i=l ^i=l ' 
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Theorem [T] tells us that, asymptotically, the dominant contribution to the excess risk, or 
regret, of the weighted nearest neighbour classifier over TZ, can be decomposed as a sum of two 
terms. The two terms, constant multiples of X^ILi '^ni (l^iLi ^^^wY respectively, represent 
variance and squared bias contributions to the regret. It is interesting to observe that, although 
the 0-1 classification loss function is quite different from the squared error loss often used in 
regression problems, we nevertheless obtain such an asymptotic decomposition. 

The constant multiples of the dominant variance and squared bias terms depend only on 
the behaviour of /i and /2 (and their first and second derivatives) on S, as seen from (12. 3p . 
Moreover, we can see from the expression for Bi in (12. 3p that the contribution to the dominant 
variance term in the regret will tend to be large in the following three situations: firstly, when 
/(■) is large on S; secondly when the Vol*^"^ measure of S is large; and thirdly when ||?7(-)|| is 
small on S. In the first two of these situations, the probability is relatively high that a point 
to be classified will be close to the Bayes decision boundary S, where classification is difficult. 
In the latter case, the regression function rj moves away from 1/2 only slowly as we move away 
from S, meaning that there is a relatively large region of points near S where classification is 
difficult. From the expression for B2 in (12. 3p . we see that the dominant squared bias term is 
also large in these situations, and also when a(-)^ is large on S. From the proof of Theorem [H 
it is apparent that a{x) ^"^j^ "^yf is the dominant bias term for Sn{x) = Yl^=i ^nil{y(i)=i} as 
an estimator of ri{x). Indeed, by a Taylor expansion, 

n 

E{Sn{x)} - r]{x) = ^ WniE.r]{X^i)) - r]{x) 

i=l 

n 1 

^ ^WniE{{X^i) - xfr]{x)} + -^w™E{(X(i) -x)^^(x)(X(i) -x)}. 

i=l 1=1 

The two summands in the definition of a{x) represent asymptotic approximations to the re- 
spective summands in this approximation. 

Consider now problem of optimising the choice of weight vectors. Let 



\2{d+2)} [B2) 



^4/(d+4) 



(2.4) 



and then define the weights w* = [w^j)"^^^ as in (II. ip . The first part of Theorem |2] below can 
be regarded as saying that the weights w* are asymptotically optimal. 

Theorem 2. Assume (A.1)-(A.4) and assume also that B2 > 0. For any /3 > and any 

sequence w„ = (u'm)"=i G Wn,i3, we have 

liminf ^ ."''^"^ — > 1. 2.5 

i?7^(C-=^^*)-i^7^(CBayes) 

Moreover, the ratio in /12.5\) above converges to 1 if and only if both 'Yj^=i'^nil Yll=ii.'^ni)'^ ~^ 1 
and ^^^i oiiWnil XlILi '^«^nj ~^ 1- Equivalently, this occurs if and only if both 

n n 

- «)'} ^ and n-W)^a,(^„,-O^0. (2.6) 

i=l i=l 
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Finally, 



2d+4 



«^''"'{fl«(C„T.) - MC"")} ^ (2.7) 

Now write for the traditional, unweighted /c-nearest neighbour classifier (or equivalently, 
the weighted nearest neighbour classifier with Wni = 1/k for i = 1, . . . , k and Wni = otherwise). 
Another consequence of Theorem [1] is that, provided (A.1)-(A.4) hold and B2 > 0, the quantity 
k* defined in (12 ■4p is larger by a factor of j^^^^}'^^*''^'''^^ (up to an unimportant rounding error) 



than the asymptotically optimal choice of k°^^ for C*""^; see also iHall et al.l ( 120081 ) . We can 
therefore compare the performance of Cn\°p^ with that of C^'^*^- 

Corollary 3. Assume (A.1)-(A.4) and assume also that B2 > 0. Then 
as n —> 00. 



Since the limit in (I2.8p does not depend on the underlying populations, we can plot it as 
a function of d; cf. Figure |2J In fact. Corollary [3] suggests a natural correspondence between 
any unweighted /c-nearest neighbour classifier and the weighted nearest neighbour classifier 
which we denote by C'"^\,^,, whose weights are of the optimal form (ll.ll) . but with k* replaced 
with 

2{d + 4:)^d/{d+i) 



d 



k 



Under the conditions of Corollary [3l we can compare C*°'l. and C*™"^" a-), concluding that for each 



/3e (0,1/2), 



as n — )■ 00, uniformly for < k < n^~^ . The fact that the convergence in (12.91) is uniform 
for k in this range means that the ratio on the left-hand side of (12. 9p has the same limit if we 
replace k by an estimator k constructed from the training data (Xi, Yi), . . . , (X„, F„), provided 
that k lies in this range with probability tending to 1. 

Finally in this section, we note that the theory presented above can be extended in a 
natural way to multicategory classification problems, where the class labels take values in the 
set {1, . . . , K}. Writing r]r{x) = P(F = r\X = x), let 



^ri,r2 = sx eTZ : argmax rir{x) = {ri,r2} 

I re{l,...,K} 

for distinct indices ri, r2 G {1, . . . , K}. In addition to (A.l) and the obvious analogues of the 
conditions (A. 2), (A. 3) and (A. 4), we require: 



10 



(A. 5) For each (ri,r2) 7^ ('"3,^4), the submanifolds iSri,r2 and 5^3,^4 of M are transversal. 

Condition (A. 5) ensures that Srj^^r^r)Sr. . rir)(7l \d7t) is either empty or a (c?— 2) -dimensional sub- 
manifold of f Guillemin and Pollack . 19741 . p. 30). Under these conditions, the conclusion of 
Theorem [1] holds, provided that the constants Bi and B2 are replaced with Bi = Ylri^r2 ^^,ri,r2 
and B2 = ^ri7^r2 -^2,ri,r2 respectively, where each term -Bi,ri,r2 and -B2,ri,r2 is an integral over 
Srj^,r2- Apart from the obvious notational changes involved in converting Bi and B2 to Bi^ri,r2 
and B2^ri,r2i the only other change required is to replace the constant factor 1/4 in the definition 
of Bi with ?7r^,r2(3^o){l ~ 1ri,r2{^o)} whcrc ?7,.i,r2(^o) dcuotes the common value that r/^i and 77^2 
take at xq G Sr^,r2- This change accounts for the fact that ?7ri,r2(3;o) is not necessarily equal to 

1/2 on Sr^^r2- 

It follows (provided also that B2 > 0) that the asymptotically optimal weights are still of 
the form ( II. ip . but with the ratio B1/B2 in the expression for k* in (12. 4p replaced with B1/B2. 
Moreover, the conclusion of Corollary [3] and the subsequent discussion also remain true. 



3 The bagged nearest neighbour classifier 

Traditionally, the bagged nearest neighbour classifier is obtained by applying the 1-nearest 
neighbour classifier to many resamples from the training data. The final classification is made 
by a majority vote on the classifications obtained from the resamples. In the most common 
version of bagging where the resamples are drawn with replacement, and the resample size is the 
same as the original sample size, baggi ng the nearest neighbour cl assifier gives no improvement 



over the 1-nearest neighbour classifier ( iHall and Samworthl . |2005| ). This is because the nearest 



neighbour occurs in more than half (in fact, roughly a proportion 1 — 1/e) of the resamples. 
Nevertheless, if a smaller resample size is used, then substantial i mprovements over the near- 



est ne ighbour classifier are possible, as has been verified empirically by lMartfnez-Munoz and Suarez 



( I2OIOI ). In fact, if the resample size is m, then the 'infinite simulation' versions of the bagged 
nearest neighbour classifier in the with- and without-replacement resampling cases are weighted 
nearest neighbour classifiers with respective weights 



and 



^b,w/o ^ r (ra/C) for z = 1, . . . , n - m + 1 
™ \0 for z = 77, - m + 2, . . . ,77,. 

Of course, the observations above render the resampling redundant, and we regard the weighted 
nearest neighbour classifiers with the weights above as defining the two versions of the bagged 
nearest neighbour classifier. It is convenient to let g = m/n denote the resampling fraction. 
Intuitively, for large n, both versions of the bagged nearest neighbour classifier behave like 
the weighted nearest neighbour classifier with weights (w^f°)"^^ which place a Geometric(g) 
distribution (conditioned on being in the set {1, . . . , n}) on the weights: 

^^r= f'-S,r ' = i.-."- (3.3) 
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The reason for this is that, in order for the ith nearest neighbour of the training data to 
be the nearest neighbour of the resample, the nearest i — 1 neighbours must not appear in 
the resample, while the ith n earest neighbour mu s t app ear, and these events are almost inde- 
pendent when n is large; see iHall and SamworthI (120051 ). Naturally, the parameter q plays a 
crucial role in the performance of the bagged nearest neighbour classifier, and for small /3 > 0, 
the three vectors of weights given in (13. ip . (13.21) and (13.31) belong to Vr„,^ for all large n if 
m.ax{^n~^^~^^^'^\ n~^^~'^^^) < q < 3n~^. In the following corollary of Theorem [H we write C^^^ 
to denote either of the bagged nearest neighbour classifiers with weights (13. ip . (13. 2p or their 
approximation with weights (13. 3 p . 

Corollary 4. Assume (A.1)-(A.4)- For every (3 G (0,1/2), 

RniC";::) - RniC'-n = 7n(g){l + o(l)}, 
uniformly for n~^^~^^ < < , where 



This result is somewhat related to Corollary 10 of iBiau. Cerou and Guyaderl (120101 ). In that 
paper, the authors study the bagged nearest neighbour estimate fjn of the regression function 
rj. They prove in particular that under regularity conditions (including a Lipschitz assumption 
on 7]) and for a suitable choice of resample size. 



n 



-2/{d+2h 



for d > 3. It is known, e.g. Ilbragimov and Khasminskii (Il980l Il98ll Il982h that this is the 
minimax optimal rate for their problem. 

Corollary m may also be applied to deduce that the asymptotically optimal choice of q in all 
three cases is 



od/{d+4)Y(2 + 2\2ci/(rf+4) 

opt V ' dJ 

(j^d/{d+A) 



BiJ 



Thus, in an analogous fashion to Section we can consider the performance of C*^"opt relative 



to that of C;jVpf 



Corollary 5. Assume (A.1)-(A.4) arid assume also that B2 > 0. Then 



ayes 



24/(d+4) 



(3.4) 



as n —)> 00. 



The limiting ratio in (13. 4p is plotted as a function of d in Figure [31 The ratio is about 
1.18 when d = 1, showing that the bagged nearest neighbour classifier has asymptotically 
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worse performance than the /c-nearest neighbour classifier in this case. The ratio is equal 
to 1 when (i = 2, and is less than 1 for d > 3. The facts that the asymptotically optimal 
weights decay as illustrated in Figure [T] and that the bagged nearest neighbour weights decay 
approximately geometrically explain why the bagged nearest neighbour classifier has almost 
optimal performance among weighted nearest neighbour classifiers when d is large. 

Similar to the discussion following Corollary [3l based on the expressions for A;°p* and 
there is a natural correspondence between the unweighted /c-nearest neighbour classifier (7^^ 

with data driven k, and the bagged nearest neighbour classifier C^l'^^, where 

£(2+1)^1 

y 2d/{d+4:) ^ ■ 

The same limit (13. 4p holds for the regret ratio of these classifiers, again provided there exists 
/3 G (0, 1/2) such that P(n^ < A; < n^'^) 1. 




Figure 3: Asymptotic ratio of the regret of the bagged nearest neighbour classifier (dashed) to 
that of the /c-nearest neighbour classifier, as a function of the dimension of the feature vectors. 
The asymptotic regret ratio for the optimally weighted nearest neighbour classifier compared 
with the /c-nearest neighbour classifier is shown as a solid line for comparison. 
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4 Faster rates of convergence 



If we allow negative weights, it is possible to choose weights satisfying Yll^=i (^i'^m = 0. This 
means that we can eradicate the dominant squared bias term in the asymptotic expansion 
of Theorem [H It follows that, subject to additional smoothness conditions, we can achieve 
faster rates of convergence with weighted nearest neighbour classifiers, as we now describe. The 
appropriate variant of condition (A. 2), which we denote by (A.2)(r), is as follows: 

(A.2)(r) The set iS = {x G 7^ : rj^x) = 1/2} is non-empty. There exists an open subset Uq 
of M.'^ that contains S and such that the following properties hold: firstly, \ri{x) — 1/2| is 
bounded away from zero for x G f/ \ Uq, where U is an open set containing TZ; secondly 
the restrictions of Pi and P2 to Uq are absolutely continuous with respect to Lebesgue 
measure, with 2r-times continuously differentiable Radon-Nikodym derivatives /i and /2 
respectively. 

Thus condition (A.2)(l) is identical to (A. 2). Note that we are still in the setting of a 
margin condition with power parameter equal to 1. For non- negative integers rui and m2 and 
ma G {0,i,l,|,2,...}, let 

2mi/ ^ • 2m2/ N j„ (2mi ) ! (2m2) IvT 



gi{mi,m2) = / cos "'^ (x) sin '"^ (x) dx 



4"^i+™2mi!m2!(mi + mg)! 

^2(^1,7712)= / cos^"'^(x)sin^"'2+^(x)rfx 



2m^f^^ .,^2m,+i. ^ 4™^+! (2mi ) !m2 ! (mi + m2 + 1) 



mi!{2(mi + m2 + 1)}! 
g-i{mi,m^) = fi'i(mi, m3)l{„3gz} + 5'2("^i, "^3)l{m3m• 
Let s = (si, . . . , Srf)"^ be a multi-index (i.e. a (i-tuple of non-negative integers). We write 
|s| = Si + . . . + Sd, and for v = (f i, . . . , va)^ G M*^, we write = vl'^V2^ ■ ■ ■'^T- Now, for 
j = l,...,d, let 

f 2 "'"^ 

lkll<i j'=i 

where we evaluate the integral by transforming to spherical coordinates. It is convenient here to 
use multi-index notation for derivatives, so we write Qsix) = • • • -§^g{x)- As non-standard 

multi-index notation, it is also convenient for j = 1, . . . , d to write gs,j{x) = -rf-gsix). Now let 



= ^^;-^^j;TT^^ ^D,,{r^,(x)/2,,(x) + ^ry,,(x)/2.(x)}, 

so that a^^\x) = a(x). Let 

Js ll'7(^o)|| 
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For £ G N, define af^ = z^+^^Z'^— (i— l)i+2^/<^. We consider restrictions on the set of weiglit vectors 
analogous to tliose imposed on rtli order kernels in kernel density estimation. Specifically, we 
let Wl^i^^^ denote the set of deterministic weight vectors w„ = (ifni)iLi satisfying 

• Er=i ^ni = 1, n^^/-^ Er=i a?Wm/n''/' Er=i «f < V log ^ for ^ = l, . . . , r - l; 



• there exists k2 < [n^ ^\ such that n'^^^'^Yl^=k2+i I'^nil/ Yl'i=i'^i "^ni < 1/logn and such 
that Ylili Oii'^Wni > f3kf'^'^; 

• Er=ik™iv(Er=i<p<i/iogr.. 

Finally, we are in a position to state the analogue of Theorem [T] for weight vectors in VF^^^. 
Theorem 6. Assume (A.l), (A.2)(r), (A. 3) and (A. 4). Then for each /3 > 0, 

RniCrn - RniC'-n = ll-\^n){l + 0(1)}, (4.1) 
as 77, — 7- oo , uniformly for w,„ G ^ where 

n / n (r) \ 2 

A consequence of Theorem [6] is that we can construct weighted nearest neighbour classifiers 
which, under conditions (A.l), (A.2)(r), (A. 3) and (A. 4), and provided > 0, achieve the 
rate of convergence 0{n~^'^^^^^'^^^) for the regret. To illustrate this, set k*^^'^ = [i?*(^')?T,^''/(4^+'^)J , 
and in order to satisfy the restrictions on the allowable weights, consider weight vectors with 
Wni = for i = k*^^'^ + 1, . . . ,n. Then, by mimicking the proof of Theorem [2] and seeking 

to minimise f l4.2p subject to the constraints X^i=i — •='■^^1 '^^—i Oii "^Wnt — for I — 
1, . . . , r — 1, we obtain minimising weights of the form 

*M = / fW(^o + haf^ + ... + Kat^) for z = 1, . . . , k<') , . 



W. 



for z = A;*(^) + 1 



n. 



The equations Ym=i "^ni = and Y^^=i ^T''^ni = for £ = 1, . . . , r — 1 for weight vectors of 
the form (14.31) yield r linear equations in the r + 1 unknowns bo,bi, . . . ,br. Although these 
equations can be solved directly in terms of bo say, simpler expressions are obtained by solving 
asymptotic approximations to these equations. In particular, since it is an elementary fact that 
for non-negative integers ii and £2, 



i=l 



15 



as /c — !■ oo, we can just deal with the dominant terms. As examples, when r = 1, we find 

(^*a))2/d(i-^o)' 



and when r = 2, we should take 



bi = TTZ7^V-^^^-^^^^^^bo\ and &2 = Jy,,,A l - bo - {k<'^f%^}. 



1 r(d + 4)2 2(rf + 4) 

Under the conditions of Theorem El and provided > 0, these weighted nearest neighbour 
classifiers achieve the 0{n~^'''^^^^^'^^) convergence rate. The choice of bo involves a trade-off 

between the desire to keep the remaining squared bias term -B2 (Z^i=i 2!"/^ ) small, and 
the need for it to be large enough to remain the dominant bias term. This reflects the fact that 
the asymptotic results of this section should be applied with some caution. Besides the discom- 
fort many practitioners might feel in using negative weights, one would anticipate that rather 
large sample sizes would be needed for the leading terms in the asymptotic expansion (14. ip to 
dominate the err or terms . This is also the reason why we do not pursue here methods such as 



Lepski's method fiLepskiil . Il99ll ) that adapt to an unknown smoothness level around S. 



5 Appendix 

Proof of Theorem [T] 

The proof is rather lengthy, so we briefly outline the main ideas here. Write P° = vrPi — (1— 7r)P2 
and observe that 

Rn{C:n - RAC'^'n = I vr[P{C'r'^(x) = 2}-l|c;Ba.ce(,)=2}]rfPi(a:) 

Jn 

+ / (1 - Tt) [P{C'r"(x) = 1} - l|cBaye.(,) = l|] dP^ix) 

Jn 

= /^{^(E^™1{%=i} <l)- Mvi^Xm'^dP^x). (5.1) 

For e > 0, let 

5" = {x e M'^ : r]{x) = 1/2 and dist(x, S) < e}, (5.2) 
where dist(a;, 5) = infa^-yg^ ||x — Xo||. Moreover, let 

The dominant contribution to the integral in (15. ip comes from TZ fl S^'\ where e„ = n"^/^'^. 
Since the unit vector ?7(a;o)/||r7(a;o)|| is orthogonal to the tangent space of S at xq, we can 
decompose the integral over TZ fl 5'^" as an integral along S and an integral in the perpendicular 
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direction. We then apply a normal approximation to the integrand to deduce the result. This 
normal approximation requires asymptotic expansions to the mean and variance of the sum of 
independent random variables in ( 15. ip . and these are developed in Step 1 and Step 2 below 
respectively. In order to retain the flow of the main argument, we concentrate on the dominant 
terms in the first five steps of the argument, simply labelling the many remainder terms as 
-Ri, R2, ■ ■ ■■ The sizes of these remainder terms are controlled in Step 6, where we also present 
an additional side calculation. 

Step 1: Let 5'„(a;) = XliLi '^™l{i'(i)=i}' ■'^^ f^n{x) = ]E{5'„(x)}, let e„ = n~^/^'^ and write 
t„ = n-^/-^ YJl=i We show that 

sup |/i„(x) - ri{x) - a{x)tn\ = o(t„), 

uniformly for w„ = {wni)^=i G Wn,i3, where a is given in (12. 2p . 
By a Taylor expansion. 



1=1 

= r]{x) + ^w™E{(X(i) - xff]{x)} + - ^w™E{(X(i) - xfii{x){X^,) - x)} + Ri, 

i=l 1=1 

(5.3) 

where we show in Step 6 that 

sup = o(t„), (5.4) 

uniformly for w„ G Wn,i3- Writing pt = pt{x) = P(||X — x|| < t), we also show in Step 6 
that for X G iS^" and i < k2, the restriction of the distribution of — x to a sufficiently 
small ball about the origin is absolutely continuous with respect of Lebesgue measure, with 
Radon-Nikodym derivative given at m = {ui, . . . , Ud)'^ by 

/(,)(n) = nf{x + u) (^^ZI)P\M^^ ' ^ll"")""' = ^^^^ + ^)P{Bm(n - 1,P||„||) = z - 1}. (5.5) 

Let 6n = {k2/ny/^'^. By examining the argument leading to (I5.29p . we see that we can replace 
5 there with 5„, to conclude that for all M > 0, 



sup sup E{||X(i) - x|| l{||x,,)-x||>5„}} = 0(n ). 



It follows that 



E{{X^i)-x)^r]{x)} = / ?7(x)^Mn{/(x + M)-/(x)}P{Bin(n-l,p||„|j) = du + 0{n-^^), 

J\\u\\<5n 

(5.6) 
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uniformly for x G 5^" and 1 < i < Similarly 



E{(X(,)-x)'f/(x)(X(i)-x)} = / M'f/(x)Mn/(x+M)P{Bin(n-l,p||„||) = i-l] du+0{n-^), 

J \\u\\<&n 

(5.7) 

uniformly for x E S'^" and 1 < i < k2- Let ki = [n^/^], and let Awni = Wni — Wn.i+i with 
Wn.n+i = (where we introduce the comma here for clarity). By a Taylor expansion, we have 

/ [?7(x)'^u n{f{x + u) — f{x)} + \u^fj{x)u nf{x + n)]P{Bin(n — l,p||„||) = i — 1} du 
= {1 + o(l)} nAwni / {rij{x)u'^jfj{x) + |r/jj(x)nJ/(x)}P{Bin(ra - l,P|j„|j) < i} du, 



\\<Sr, 



uniformly for x G iS*^" and w„ G IVn,^- Now, P{Bin(n — l,p||„||) < i} is decreasing in and is 
close to 1 when is small and close to zero when is large. To analyse this more precisely, 
note that p\\u\\ = f{x)ad\\u\\'^{l + 0(||m||^)} as m — )■ 0, uniformly for x G iS*"", so it is convenient 
to let bn = (^ ("~^)°rf'^('^) ^^/'^ and set v = hnU. Then there exists such that for n > uq, we have 
for all X G S"", all ||f H'^ G (0, 1 — 2/logn] and all ki < i < k2 that 

i-{n- l)p\\v\\/b„ > ^ 



logn 



Thus by Bernstein's inequality (jShorack and Wellnerl . Il986l p. 440), for each M > and for 
n > no. 



sup sup [l - P{Bin(n - l,p\\y\i/i,J < i}] < expl - 

||«||'*e{0,l-2/lognl fci<i<fc2 \ 



;{0,l-2/logn] fci<i<fc2 V oiOg n 

Similarly, for n > Uq, 



Oin~^'). (5.9) 



sup sup P{Bin(n - 1,pm/k) < ^} < exp( -— ^ ) = ©(n-*^. (5.10) 

||?;||''e[l+2/logn,fe„5„] fci<i<fc2 \ olOg 

We deduce from ([51]), ([EZD, dElD, O and fl530D that 

k2 k2 



^M;„iE{(X(i) - xY'q{x)} + ^ ^ tf;„iE{(X(i) - x)^r/(x)(X(j) - x)} 



2 

i=l i=l 



{^ + o{l)}Y'^{r^{xfm + \/Xr^{x)Kx)] [ Wvf dv 

"-^n Jv:\\v\\<l 
a{x)tn + 0{tn) , (5.11) 
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uniformly for x G 5^" and w„ G Wn.,^- Combining (15. 3p . fl5.4p and (15. lip , this completes Step 1. 
Step 2: Let = Var{S'„(a;)} and let = Yl'^=i'^ni- We claim that 

1 



sup 



OS. 



uniformly for w„ G To see this, note that 

n n 

i=l 

n 

^^^,[Er/(X(,))-{Er/(X(,))}2]. 



j=i 



i=l 



But by a simplified version of the argument in Step 1, we have 



It follows that 



sup 



sup sup |E?7(X(j)) — — 0. 

X^S^ri l<i<k2 



n ^ fc2 n 

J^wlEr^iX^,)) - -sl < sup X]^nJEr/(X(,))-r/(a;)|+ ^ 

2 = 1 " — ^ ' 



i=l i=fc2+l 

+ sl sup |r/(a;) - 1/2| = o{sl), 



uniformly for w„ G IV„,/3. Similarly, 

^i/;2^{Er/(X(,))}2--s^ < sup 5^^/;^jEr/(X(,)) - ry(x)||Er7(X(,)) + ry(x)| 

=1 

n 

2 J2 wl + si sup |r/(x)2 - 1/4| = o(s. 



sup 



i=l 



k2 



i=l 



n/ 5 



j=fc2 + l 

uniformly for w„ G This completes Step 2. 

Step 3 For xq & S and t G M, we write Xq = xq + tfj{xo)/\\fi{xo)\\ for brevity. Moreover, we 
write ip = 71 f I — (1 — 7r)/2 for the Radon-Nikodym derivative with respect to Lebesgue measure 
of the restriction of P° to 5^" for large n. We show that 

/ [¥{Snix) < 1/2} - l{,(.)<i/2}] diPi - P2){X) 

ij{xl)[¥{Sn{xl) < l/2}-l|,<o}](it(iVol'^"'M{l + o(l)}, (5.12) 




S J —tn 
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uniformly for w„ G VTn,/?- Recalling the definition of 5^"'^" in ( 15. 2p . note that for large n, the 
map 



Xn 



is a diffeomorphism from {(xq, t?7(xo)/||?7(a:o)||) : Xq G 5''"''", |t| < e„} onto S"" (iGrayl . 12004 
pp. 32-33). Observe that 



{x G M"^ : dist(x,5) < e„,} C 5'" C {x G M"" : dist(x,5) < 2e„}. 



(5.13) 



Moreover, for large n and |t| < e„, we have sgn{77(xQ) — 1/2} = sgn{ip{xQ)} = sgn(t). The 
pullback of the d-form dx is given at {xQ,tf]{xo)/\\i){xo)\\) by 



dt dVol 



d-l, 



[Xoj 



{1 + o{l)} dt dVol 



where the error term is uniform in (xo,t?7(xo)/||?7(xo)||) f or Xn G and |t| < e„,. I t follo ws from 
the theory of in tegration on manifolds, as descri bed in iGuillemin and PoUackl (jl974j . p. 168) 
and I Gray! (l2004l . Theorems 3.15 and 4.7) (see also lMoorel (jl992h ). that 



[nSnix) < 1/2} - l{,(.)<i/2}] dP°{x) 



ij{x'o)[F{Sn{x'o) < l/2}-l|i<o}] (it(iVol'^"'(xo){l + o(l)}, 



(5.14) 



uniformly for w„ G PVn,/3- But 5"" \ 7^ C {x G R'^ : dist(x, 95) < e„}, and this latter set has 
volume O(e^) by Weyl's tube formula fiGrayl . |2004| . Theorem 4.8). Thus the integral over 
in fl5.14p may be replaced with an integral over TZ H iS^" and, similarly, the integral over 5'^"'^" 
may be replaced with an integral over S, without changing the order of the error term in f l5.14p . 
Thus fl5.12p holds, and this completes Step 3. 

Step 4: We now return to the main argument to bound the contribution to the risk (15. ip 
from TZ \ S"". In particular, we show that 



sup [ [F{Sn{x) < 1/2} - l{,(.)<i/2}] dP°{x) = 0( 



n 



-M\ 



(5.15) 



for all M > 0. To see this, recall that |?7(x) — 1/2| is assumed to be bounded away from zero 
on the set TZ\S'^ (for fixed e > 0), and ||?7(xo)|| is bounded away from zero for xq G S. Hence, 
by f l5.13p in Step 3, there exists ci > such that, for sufficiently small e > 0, 



inf \t](x) — 1/2| > Cie. 



(5.16) 



We also claim that fin{x) = ^{^^(x)} is similarly bounded away from 1/2 uniformly for x G 
TZ \ 5^". In fact, we have by Hoeffding's inequality that 



X(fe,) - x|| > e„/2) = P{Bin(n,p,„/2) < ^2} < e-i^"^-/^- 



20 



It follows that 
sup fin{x) - ^ 

V{x)<l/2 

< ' 



sup ^^i;„P(r(,) = ln||X(,,)-xl| <ej2)-- + e-^("f-/2-fc2)^+ ^ .^Jm} 
<f2^m{l-"-^)-l + e-^("^-/-^^)^ + n~'' < -\c,e^ (5.17) 

i=l 

for sufficiently large n. Similarly, 

1 1 

inf - - > inf V u;™P(F(i) = 1 n ||X(fc,) - a:|| < e„/2) - - 

r)(x)>l/2 ■q{x)>l/2 «=1 



> (1 - n-^) (i + ^) (1 - e-^("--/-'=^)^) - i > ici6., (5.18) 




for large n. 

Now we may apply Hoeff ding's inequality again, this time to S'„(x), to deduce that 
sup sup |P{5'„(x) < 1/2} - l{^(^)<i/2}| < sup sup e = 0{n~ ), 

for each M > 0, using fl5.17p and fl5.18p and the fact that < for w„ G This 
completes Step 4. 

Step 5 We now show that 

ij{xl) [nSn{4) < 1/2} - l{i<o}] rft dVol'^-^(xo) = B^sl + Sst^, + o(4 + t^), 

uniformly for w„ G where Bi and i?2 were defined in (12. 3p . When combined with (15. ip 

and the results of Step 3 and Step 4 (in particular, (I5.12p and (I5.15P ). this will complete the 
proof of Theorem [H 
First observe that 

^{4)[F{Sn{4) < 1/2} - l{i<o}] dtdVol''-\xo) 

S J -en 

tmxo)\\ [nSnixD < 1/2} - l{i<o}] dtdVol''~\xo){l + 0(1)}. (5.19) 

Now, Sn{x) is a sum of independent, bounded random variables, so by the non-uniform version 
of the Berry-Esseen theorem, there exists Ci > such that 





sup sup sup 



^1/2(1 + |^|3) 
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where $ denotes the standard normal distribution function. Thus 

t\\ij{xo)\\{F{Sn{4) < 1/2} - l{t^o}}dtdVof-\xo) 

l/2-f,n{4) 





t 



where we show in Step 6 that 



o{sl+tl\ 



1 



{t<0} 



dt(No\'^-\xo) + R2, 



(5.20) 



uniformly for w„ G Wn,/^. Moreover, by a Taylor expansion and Step 1 and Step 2, 

l/2-/i„(x*)~ 




t||V^(xo)|||$ 



Ht<o} 




$<{ — ( -t\\r]{xo)\\ -a{xo)tn 



dt(Nof^^{xo) 
- l{t<o} 



dtdYof-\xQ) + 



where we show in Step 6 that 



\R^ 



(5.21) 



uniformly for w„ G Wn,^- Finally, we can make the substitution r = t/sn to conclude that 




S J —en 




U\'lp[Xo 



-t\\r]{xo)\\ - a{xo)tn 



Ht<o} 



dtdYoV^-\xo) 



-00 

BlSn + B2t\ + -R4, 



2U 



Av{xq)\\ -a{xQ] 



4«<o} 



dudVof~^{xo) + Ri 



where Bi and B2 were defined in fl2.3p . We have used the fact that ||^/'(a;o) 11/11^(2^0) II = '^fixo) 
for xq € 5 in the final step of this calculation. Once we have shown in Step 6 that 



LR4 



OS, 



(5.22) 



uniformly for w,„ G this will complete Step 5 and hence the proof of Theorem [TJ 

Step 6 To show ( 15. 5\) . which gives the Radon-Nikodym derivative of the restriction of the 
distribution of — x to a small ball about the origin: Recall that Bs{u) = {y E M.'^ : 
\\y — u\\ < 6}, and that Ud denotes d- dimensional Lebesgue measure. For a Borel subset A of 
W^, let N{A) = X;r=i l{x,eA}- It follows from the hypothesis (A. 2) that for x G S'" with n 
sufficiently large, and for i < k2, the restriction of the distribution of — x to a small ball 
about the origin is absolutely continuous with respect to z/^. Thus for x G iS*^" with n sufficiently 
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large, for i < k2, for u with sufficiently small, and for 6 < \\u\\, 
P{X(,) - X e Bsju)} 

^ r^^{^(^'5(^ + «)) = 1' NiB^Hhsix)) = ^ - 1, iV(5||„||+,(x)) =n-t} 
f{v) dv 



n 



' Bs{x+u) 

^ nf{x + u) ~ ^%|h1(1 - PMiT 



as 5 —7- 0, by the Lebesgue differentiation theorem. For the other bound, write A = B\\u\\+5{x) \ 
B\\u\\~5{x) and observe that 

P{X(,) - X G Bs{u)} 



Vd{Bs{u)) 
1 



< 



P{X(i) G B&{x + u)n N{A) = 1} + ¥{N{B5{x + m)) > 1 n N{A) > 2} 



n 



Bs{x+u) 



n — 1 



as 5 — > 0. The result therefore follows by iFoUandl (119991 . Theorem 3.22). 
To show ^5.4\), which bounds Ri. We have 



Ri= Yl Wm[E{rjiX^^)} - rjix)] 

i=k2+l 



i=l 

= Rii + Ri2 



1. 



E{r/(X(,))} - r/(x) - E{(X(,) - xfi^{x)} - -E{(X(,) - x)^f)(x)(X(,) - x)} 



say. Now r/ < 1, so 



sup sup 



I -Rill ^ ^i=k2+l'^ni 

_ < sup — 



< 1 / log n. 



To handle i?i2, observe that by a Taylor expansion, given e > 0, we can find 5 > such that 
for all sufficiently large n, all x G 5*^" and all ||y — x|| < 5, we have 

\ri{y) - ri{x) - {y - xfri{x) - \{y - xfri{x){y - x)\ < e\\y - xf. 



23 



For 1 < i < k2, let Ai = {||^(i) — < 6}. Further, let Di = sup^.^^ ||?7(a;)|| and let D2 = 
sup^jg^ Ainax{^(a^)}5 where Amax(') denotes the largest eigenvalue of a matrix. Then for large n 
and X e S"", 



k2 



\Ri2\ <eY,WmmXii) - xf UJ + 2F{A'r) + 2DiE{\\X^i) - x\\1a^J + 2D2E{||X(,) - xf 1^.}. 



i=l 



(5.23) 

We can apply a very similar argument to that employed in Step 1 to deduce that uniformly 
for 1 < i < /c2, 

sup E(||X(,) - xf l^J = 0{{z/nf''}. (5.24) 



Now 



E(||X(,) - xf l^c) = P(||X(,) -x\\>6)+ / P(||X(,) - x|| > f/^) dt 



<52 

00 



poo 

< F{Bm{n,ps) < i} + P{Bin(n,pji/2) < i} dt. 



(5.25) 



For 5 > sufficiently small, there exists C2 > such that np^ — k2> C2n5'^. So by Hoeffding's 
inequality, for any to > ^5 



sup sup 

x65'=i ki<i<k2 



to 



¥{Bm{n,ps) < i} + P{Bin(n,Pti/2) < i} dt 



52 



< (l+to)e-''^"^' 



for every M > 0. Moreover, using the moment bound in (A. 3), 

l-pt = P{{x + u : \\u\\ > t}) = 0{t-P) 



0(n-^), 
(5.26) 

(5.27) 



as t — >■ 00, uniformly for x G 5^". Therefore we can apply Bennett's inequality ( IShorack and Wellner . 



19861 . p. 440) to show that there exist C3, C4 > such that for sufficiently large n and to and all 

t > to, 



sup sup P{Bin(n,pji/2) < i} < exp 

xg<S'=" ki<i<k2 



|log(l + - 



1 



We deduce from ([E25D, I^M) and I K28^ that 



sup sup E{||X(i) - xll'l^c} = 0{n-^) 



< (1 +C4t''/')-'==*"/l 

(5.28) 
(5.29) 



for all M > 0. This result, combined with f l5.24p and Markov's inequality applied to the two 
central terms in f l5.23p . proves (15. 4p as required. 

To show Ii5.20\) . which bounds R2'- Observe that by Step 1 and Step 2, there exist constants 
C5, C*2 > such that 

|l/2 - l2n{Xo)\ ^ C5\t\ 



inf inf 

xo£S C2tr,<\t\<en Crn[Xo) 
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uniformly for w„ G Wn,i3. Hence, 

m < ' "^"^ / / \t\mxo)\\ dtdvof-\xo) 

JSJ\t\<C2tn 



■^n Js Jc2t„<\t\<tn 1 + ^il^P/Sn 



uniformly for w„ G VTn,/?, as required. 
To show h5.21\) . which hounds R3. Let 

^ _ -a(xo) tn 
11^7(3^0)11 s„' 

Using the results of Step 1 and Step 2, given e G (0, infa.og5 ||?7(xo) ||) sufficiently small, for 
large n we have that for all w„ G Wn,i3, all Xq G 5 and all r G [— Cn/sn, ^n/sn] that 



It follows that for large n, 



< e^i\r\+tn/sn). 



< 



1 if |r - r^^J < etn/sn 

+^n./sn)0(||?7(a;o)||k-^xol) if e^n/sn < \r\ < en/sn- 



We deduce that for large n, 



dt 



00 



< e^n / \r\dr + sl e^{\r\ + tn/ Sn)(f){\\r]{xo)\\\r - r^^l) dr 

\r-rxo\<iitn/s„ J -co 



<<sl + tl). 



This allows us to conclude fl5.21l) . 

To show ( (5. 22^) . which hounds R^. We have 



\Ra\ = sI / / |r|[<l>{-2||?)(xo)||(r-r,J}-l|,<o}]drdVof-i(xo) 

^5 J \r\>tn/ Sn 



<2si / |r|$(-||77(a;o)||r)rfrciVol'^-^(xo) = o(s^) 

^5 ^ r>en/Sn 
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uniformly for w„ G Wn,i3, as required. □ 

Proof of the fact that conditions (A.1)-(A.4) imply the margin condition ( 12 .11) 
For the upper bound, recall from f l5.16p that by the mean value theorem, for sufficiently small 
e > 0, 

inf Iriix) — 1/21 > c*e, 

where we may take = inir^^Uo 11^(^)11 5 which is positive. By shrinking Uq if necessary, we may 
assume that D* = sup^g^^^ f{x) < 00, and it follows that for small e > 0, 

F{\7]{X) - 1/2| < e n X G 7^) < P(5^/'=*) < D*Ud{S'^^-') < Ce, 

where C < 00, using Weyl's tube formula (jOravl . 120041 ). 

For the lower bound, we construct a tube similar to S"^, but contained in TZ. To do this, let 
See = {x & S : dist(x, dS) > e} and let 

Se = \xo + t-^!p^ : Xo G See, \t\ < € 

Further, let C* = sup^fzUo\\v{^)\\j which is finite. Again by the mean value theorem, for 
sufficiently small e > 0, 

sup \ri{x) - 1/2| < C*e. 

Thus, letting = inixeUo f{x) > 0, for sufficiently small e > 0, 

niviX) - 1/2| < e n X G 7^) > PiSe/c*) > dM^e/c) > ce, 
where c > 0, again using Weyl's tube formula. □ 
Proof of Theorem [2] 

Consider any vector of non-negative weights w** = (ti?**)"^^ that minimises the function 7„(-) 
defined in the statement of Theorem [TJ Since is symmetric in Wni, ■ ■ ■ ,Wnn, while is 
increasing in i, we see that {w^*)^^^ is decreasing in i. We let k** = max{z : w^* > 0}. Now 
form the Lagrangian 



Then for some A** 



L(w„, A) = ^Bisl + ^B2tl + ^ ^™ - 1 j • 

k" 







dWr, 



= B^w*:, + "^-^ + (5-30) 



for i = 1, . . . , k**. By summing (15.301) from i = 1, . . . , k**, and then multiplying (I5.30p by aj 
and again summing from « = !,..., k**, we obtain two linear equations in Yl'j=i ^j'^n*j ^^'^ 
which can be solved and substituted back into (I5.30p to yield 
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for i = 1, . . . , k**. In particular, w** is the unique minimiser of 7n(")- The weight vector w* 
is asymptotically equivalent to w** in the following sense: elementary calculations reveal that 
k** = k*{l + 0{{k*)~^)}, and moreover that 

2^'^^<* = - ,' 4 + )} and }_^a,< = ^ , ' {^ + 0{{k*) ^)}, 

1=1 i=i 

It follows immediately that 



, -{l + o(l)}^l, 
RniCZ^,) - i?7e(Ci^^y-) 7n(w*) 

and therefore that ( 12. 5 P holds. 

Arguing similarly to the above, we see that the conditions Yl^=i'^ni/ '^^=ii'^niy ~^ ^ and 
Yli=i^i'^ni/ '^^=iC(iW^i — 1, or equivalently fl2.6p . are sufficient for f l2.5p to hold. To see the 
necessity of these conditions, suppose for now that for some small /3 > 0, the weight vector 
w„ e Wn,i3 satisfies 

f^nC";2 ^^n^r6[0,l). (5.31) 
Then, by almost the same Lagrangian calculation as that above, we have 

limmt > 1, 

Rn{C:;-^J - Rn{C^^^'') 

where w„ = {wni)i=i is given by 

1 / d da. 



and where k/k* — )■ 1/r. It follows that for small /3 > 0, and for any w„ G PVn,/3 satisfying 
Er=i^ni/Er=i«i)^ < we have 

liminf ^ /•'^"^ > lim ^ ."''^"^ — = (^ + 4r) > 1. 

Rn{CZ^,^) - RniC^'"^'"'') ""^^^ Rti{C^^,J - Rn{C^''^''') d + AKr^/'^ J 

(5.32) 

A very similar argument shows that if 

^r^"-^: ^aG[0,l), (5.33) 
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then the conclusion of f l5.32p also holds. But if (12 .Sp holds and it is not the case that both 

Li CiiWni/ Yl^=i o^i'^ni 1' then either (|5.3ip or (|5.33p would 
have to hold on a subsequence. But then we see from (15.320 that (12. 5p cannot hold, and this 
contradiction means that the conditions ( 12. 6p are necessary for ( 12. 5p . 

The final part of the theorem, deriving ( 12. 7p . is an elementary calculation and is omitted. 

□ 

Proof of Corollary H] 

This corollary follows from Theorem [1] and the following facts: 



^ b,with r(2 + 2/ci) , 1 , ,\\ ^ 

i=l ^ \ y /J I— I 

Z^"*^ni = — ^^7^^ — \ Km j] ^ 



b,with\2 



b,w/oN2 _ 



1 + 



.Geo 



^^^^±^{l + 0(g)} and B^.^)^ = f {1 + 0(g)}, 



where the error terms are, in each case, uniform for n~^^~^^ < g < . 
Proof of Theorem [6] 

Let tn ^ = n"^'"/'^ XliLi '^i''^^ni- We only need to show that 



sup 



Unix) - r]{x) - a^''^(x)4''^ 



n J ) 



□ 



(5.34) 



uniformly for w„ G Wl^^^, because the rest of the proof is virtually identical to that of Theo- 
rem [H The appropriate analogue of ( 15. 8p is 

['i){x)^un{f{x + u) — f{x)} + i){x)u nf {x + M)]P{Bin(n — l,P||?i||) = i — 1} du 



k2 
i=l 



\\<Sn 



nAwr,.i 



^(2r-2)!^ ^ 

i=ki j=l s:|s|=r— 1 



u\\<5n 



{rij{x)u^ju''fsj{x) + lrijj{x)u^ju'fs{x)} 

X P{Bin(n - l,p||„||) < i}c/M, (5.35) 



uniformly for x G S'" and w„ G W^^^^. Combining (I5J]), (ETD, (1513511 . ( CT and ( 1530D . we 
have 



fc2 -j^ fc2 

^u;™E{(X(,) - xffiix)} + 7;Y1 ^™E{(X(,) - xfii{x){X^i) - x)} 



i=l 



i=ki 



nAwr, 



{1+0(1)}^——-^^ E 

i=l ^ ^' " j=l s:|s|=r-l 

aM(x)e)+o(t(:)), 



{Vj{x)fsj{x) + |r/jj(x)/,(x)} 



t):|li;||<l 



(5.36) 
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uniformly for x G iS^" and w„ G W^^^. Combining (15. 3p . the analogue of (15. 4p and (I5.36p . this 
proves (I5.34p . □ 
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